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ABSTRACT 

We study the problem of quantization of discrete prob- 
ability distributions, arising in universal coding, as well 
as other applications. We show, that in many situations 
this problem can be reduced to the covering problem for 
the unit simplex. Such setting yields precise asymptotic 
characterization in the high-rate regime. Our main contri- 
bution is a simple and asymptotically optimal algorithm 
for solving this problem. Performance of this algorithm is 
studied and compared with several known solutions. 

1. INTRODUCTION 

The problem of coding of probability distributions sur- 
faces many times in the history of source coding. First 
universal codes, developed in late 1960s, such as Lynch- 
Davisson 11211 [9], combinatorial 128], and enumerative 
codes [7] used lossless encoding of frequencies of sym- 
bols in an input sequence. The Rice machine |24l . de- 
veloped in early 1970's, transmitted quantized estimate 
of variance of source's distribution. Two-step universal 
codes, developed by J. Rissanen in 1980s, explicitly esti- 
mate, quantize, and transmit parameters of distribution as 
a first step of the encoding process l25ll26ll . Vector quanti- 
zation techniques for two-step universal coding were pro- 
posed in 1 30 . 4- 1 . In practice, two-step coding was often 
implemented by constructing a Huffman tree, then encod- 
ing and transmitting this code tree, and then encoding and 
transmitting the data. Such algorithms become very popu- 
lar in 1980s and 1990s, and were used, for example, in ZIP 
archiver iTFTI . and JPEG image compression standard fl6l . 

In recent years, the problem of coding of distributions 
has attracted a new wave of interest coming from other 
fields. For example, in computer vision, it is now custom- 
ary to use histogram-derived descriptions of image fea- 
tures. Examples of such descriptors include SIFT ll20ll . 
SURF 1 1 1, and CHoG [2], differentiating mainly in a way 
the quantize histograms. Several other uses of coding of 
distributions are described in fPH . 

To the best of our knowledge, most related prior stud- 
ies were motivated by optimal design of universal source 
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codes |25), |26), |30|, El, Q3). In this context, quan- 
tization of distributions becomes a small sub-problem in 
a complex rate optimization process, and final solutions 
yield very few insights about it. 

In this paper, we study quantization of distributions as 
a stand-alone problem. In Section 2, we introduce nota- 
tion and formulate the problem. In Section 3, we study 
achievable performance limits. In Section 4, we propose 
and study an algorithm for solving this problem. In Sec- 
tion 5, we provide comparisons with other known tech- 
niques. Conclusions are drawn in Section 6. 

2. DESCRIPTION OF THE PROBLEM 

Let A = {r\, . . . , r m }, m < oo, denote a discrete set of 
events, and let Q m denote the set of probability distribu- 
tions over A: 



a, 



i 



(i) 



Let p € Q m be an input distribution that we need to en- 
code, and let Q c O m be a set of distributions that we 
will be able to reproduce. We will call elements of Q 
reconstruction points or centers in fi m . We will further 
assume that Q is finite \Q\ < oo, and that its elements are 
enumerated and encoded by using fixed-rate code. The 
rate of such code is R(Q) = log 2 \Q\ bits. By d(p,q) 
we will denote a distance measure between distributions 
P,g e Q m . 

In order to complete traditional setting of the quan- 
tization problem for distribution p 6 f2 m , it remains to 
assume that it is produced by some random process, e.g. 
a memoryless process with density 8 over f2 m . Then the 
problem of quantization can be formulated as minimiza- 
tion of the average distance to the nearest reconstruction 
point (cf. lfT4l Lemma 3.1]) 

d(Cl m ,6,R)= inf E pe o m minc?(p, q) , (2) 

QCf2 m p ^ e q£Q 
\Q\<2 R 

However, we notice that in most applications, best pos- 
sible accuracy of the reconstructed distribution is needed 
instantaneously. For example, in the design of a two-part 
universal code, sample-derived distribution is quantized 
and immediately used for encoding of this sample |26| . 
Similarly, in computer vision / image recognition appli- 
cations, histograms of gradients from an image are ex- 



tracted, quantized, and used right away to find nearest 
match for this image. 

In all such cases, instead of minimizing the expected 
distance, it makes more sense to design a quantizer that 
minimizes the worst case- or maximal distance to the near- 
est reconstruction point. In other words, we need to solve 
the following problem^ 

d*(fl m ,R) = inf max mmd(n.q). (3) 

Qcfi m P ea m qeQ 

\QK2 R 

We next survey some known results about it. 

3. ACHIEVABLE PERFORMANCE LIMITS 

We note that the problem (f3]i is purely geometric in nature. 
It is equivalent to the problem of covering of Q m with at 
most 2 R balls of equal radius. Related and immediately 
applicable results can be found in Graf and Luschgy lfl4l 
Chapter 10]. 

First, observe that £l m is a compact set in M m_1 (unit 
m — 1-simplex), and that its volume in R m_1 can be com- 
puted as follows [29|: 



1 
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(4) 



Here and below we assume that m ^ 3. 

Next, we bring result for asymptotic covering radius 
|[T4l Theorem 10.7] 



lim 2> 

R— >oo 



: d* (fi m , R) = C^r^A™- 1 ^™), (5) 



where C m _i > is a constant known as covering coeffi- 
cient for the unit cube 



C r, 



_! = inf 2" 

R^O 



-d*([0,l} m -\R). 



(6) 



The exact value of C m -i depends on a distance mea- 
sure d(p, q). For example, for norm 



doc(p,q) = 
it is known that 



max \pi - q t 
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Hereafter, when we work with specific L r - norms: 



d r (p,q) = \\p- q\\r = (5ZlPi - 



l/r 



(7) 



we will attach subscripts r to covering radius d*(.) and 
other expressions to indicate type of norm being used. 
By putting all these facts together, we obtain: 



The dual problem 

R(e) = inf log 2 \Q\ , 

may also be posed. The resulting quantity R(e) can be understood as 
Kolmogorov's e-entropy for metric space (Q m ,d) |18|. 




Figure 1 . Examples of type lattices and their Voronoi cells 
in 3 dimensions (m = 3, n = 1, 2, 3). 

Proposition 1. With R — > oo: 



Onm,fl)~§ m -Vi^2-^ (8) 
and more generally (for other L r -norms, r ^ 1): 

where C m -i r are some constants. 



(9) 



We further note, that with large m the leading term 
in (O turns into 



' - ■ (){— 2 



(m-l)! m 



(10) 



which is a decaying function of m. This highlights an in- 
teresting property and distinction of the problem of quan- 
tization of m-ary distributions, compared to quantization 
of the unit (m — l)-dimensional cube. 

Our next task is to design a practical algorithm for 
solving this problem. 

4. PRACTICAL ALGORITHM FOR CODING OF 
DISTRIBUTIONS 

4.1. Algorithm design 

Our algorithm can be viewed as a custom designed lattice 
quantizer. It is interesting in a sense that its lattice coin- 
cides with the concept of types in universal coding [8 1. 

4.1.1. Type Lattice 

Given some integer n ^ 1, define a lattice: 



Qr, 



<li 



= (U) 



where n, k%, . . . , k m £ Z + . Parameter n serves as a com- 
mon denominator to all fractions, and can be used to con- 
trol the density and number of points in Q n . 

By analogy with the concept of types in universal cod- 
ing [8j we will refer to distributions q G Q n as types. 
For same reason we will call Q n a type lattice. Several 
examples of type lattices are shown in Figure 1 . 

4.1.2. Quantization 

The task of finding the nearest type in Q n can be solved 
by using the following simple algorithm: 0: 



This algorithm is similar in concept to Conway and Sloane's quan- 
tizer for lattice A n f5\ Chapter 20]. It works within (m-l) simplex. 



Algorithm 1. Given p, n, find nearest q = . . . , ^-j : 
7. Compute values (i = 1, . . . , m) 

2. If n' = n the nearest type is given by: ki — k[. 
Otherwise, compute errors 

5 i = k' i - npi , 

and sort them such that 



3. Let A = n' — n. If A > then decrement d values 
k[ with largest errors 

j =£,..., to- A - 1, 
1 i = to — A, . . . , to , 



otherwise, if A < increment \A\ values k[ with 
smallest errors 

1 i = l,...,|A|, 
i = |A| + l,...,m. 



The logic of this algorithm is obvious. It finds points 
that are nearest in terms of L-norms. By using quick- 
select instead of full sorting in step 2, its run time can 
be reduced to 0{m). 

4.1.3. Enumeration and Encoding 

As mentioned earlier, the number of types in lattice Q n 
depends on the parameter n. It is essentially the number 
of partitions of n into m terms kx + . . . + k m = n: 

■fm- T 



\Qn\ = 



m — 1 



(12) 



In order to encode a type with parameters k\ , . . . , k 
we need to obtain its unique index 
gest to compute it as follows: 

£(fci, . . . , k n ) — 

- i - J2]=i k i + m 
m — j — 1 



n-2 kj — X 

EE 

j=X i=0 



. . , km). We sug- 
(13) 

i-i 



This formula follows by induction (starting with m = 
2,3, etc.), and it implements lexicographic enumeration 
of types. For example: 

C(0,0,...,0,n) = 0, 
£(0,0,...,l,n-l) = 1, 

£(n,0,...,0,0) = r^T 1 )-!- 

Similar combinatorial enumeration techniques were dis- 
cussed in [28 , 7 , 27 1 . With precomputed array of binomial 
coefficients, the computation of an index by using this for- 
mula requires only 0(n) addition^ 

Once index is computed, it is transmitted by using its 
direct binary representation at rate: 



R(n) = log 2 



/ro+m— 1\ 
V m-1 ) 



(14) 



4 Considering that log |Q rI | = O(logra), this translates into at most 
0(n logn) bit-additions 



4.2. Analysis 

Type lattice Q n is related to so-called lattice A n in lattice 
theory [5 Chapter 4]. It can be understood as a bounded 
subset of A n with n = to — 1 dimensions, which is sub- 
sequently scaled, and placed in the unit simplex. 

Using this analogy, we can show that vertices of Voronoi 
cells (so called holes) in type lattice Q n are located at: 

qf=q + Vi, q e Q n , i = 1, . . . ,m - 1, (15) 

where Vi are so-called glue vectors J5] Chapter 21]: 



mi— i —i 
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ra—i times 



(16) 



We next compute maximum distances (covering radii). 
Proposition 2. Let a = |_to/2_|. The following holds: 

£Z™t doo{p > q) = (17) 



max min (iofp, a) 
pen m qeQ n 

max min di (p, q) 

P en m qeQ n v ' 



1 / a{rn — a) 



fi v m 
1 2q(m— a) 



n rn 



(18) 
(19) 



Proof. We use vectors ( TToT l. The largest component values 
appear when i — 1 or i = m — 1. E.g. for i = 1: 

vx = - [2S=l,=i,...,=il . 

- 1 - n L m ' m 1 1 m J 

This produces - radius. The largest absolute sum 
is achieved when all components are approximately the 
same in magnitude. This happens when i = a: 



a times m— a times 

This produces Lx - radius. L2 norm is the same for all 
vectors Uj, i > 0. □ 

It remains to evaluate distance / rate characteristics of 
type-lattice quantizer: 

d*\Q n ](fl m , R) = min max mind r (»,fl). 

n:|Q„|<2«penm?e<2n 



We report the following. 

Theorem 1. Let a = [m/2\. Then, with R — > 00: 

I _ J_ 

d*oo[Qn](tt m , R) ~ 2— 



<5[QT.](nm,i2) ~ 2- 

dt[Q„](O miJ R) - 2" 



-V(m-l)! 

a(rn— a) 
m 

-^(m-l)! 

2a(m — a) 
m 



, (20) 



, (21) 



■ (22) 



Proof. We first obtain asymptotic (with n — > oo) expan- 
sion for the rate of our code ( fT4l >: 

E=(m-l)log 2 n-log 2 (m-l)! + 0(i) . 

This implies that 

n ~ 2^r m -{/{m- 1)!. 

Statements of theorem are obtained by combination of this 
relation with expressions ( [T7lfT9b . □ 

4.2.1. Optimality 

We now compare the result of Theorem 1 with theoretical 
asymptotic estimates for covering radius for Vt m ([8] 
As evident, the maximum distance in our scheme decays 
with the rate R as: 

d*[Q n ](n m ,R)~2-^, 

which matches the decay rate of theoretical estimates. 

The only difference is in a constant factor. For exam- 
ple, under norm, such factor in expression ^ is 



Our algorithm, on the other hand, uses a factor 



n=2, dual lattice 



1 



S$ 1 



1 



< 1, 



which starts with i when m = 2. This suggests that even 
in terms of leading constant our algorithm is close to the 
optimal. 

4.2.2. Performance in terms of KL-distance 

All previous results are obtained using L-norms. Such dis- 
tance measures are common in computer vision applica- 
tions ll20l l22l [Tl . In source coding, main interest presents 
Kullback-Leibler (KL) distance: 



dKh(p,q) = D(p\\q) 



Pi log; 



ft 
Qi 



(23) 



It is not a true distance, so the exact analysis is compli- 
cated. Yet, by using Pinsker inequality j 



we can at least show that for deep holes 

d K L(q ,q) > 21^ m ) 

By translating n to bitrate, we obtain 

2a(m— a) 



(24) 



(25) 



More precise bounds can be obtained by using inequalities 
described in [ 10]. 




Figure 2. Two 10-point lattices: Q3 (left), and Q* 2 (right). 
The second has much smaller cells. 

4.3. Additional improvements 

4.3.1. Introducing bias 

As easily observed, type lattice Q n places reconstruction 
points with fcj = precisely on edges of the probability 
simplex Q m . This is not best placement from quantization 
standpoint, particularly when n is small. This placement 
can be improved by using biased types: 



n + 13m 



where (3 ^ is a constant that defines shift towards the 
middle of the simplex. In traditional source coding ap- 
plications, it is customary to use (3 = 1/2 lfl9l . In our 
case, setting f3 = 1/m appears to work best for L-norms, 
as it introduces same distance to edges of the simplex as 
covering radius of the lattice. 

Algorithm 1 can be easily adjusted to find nearest points 
in such modified lattice. 

4.3.2. Using dual type lattice Q* n 

Another idea for improving performance of our quantiza- 
tion algorithm - is to define and use dual type lattice Q* . 
Such a lattice consists of all points: 

q* = q + Vi, q G Q n , q* G O m i = 0, . . . ,m - 1, 

where v\ are the glue vectors dT6l >. 

The main advantage of using dual lattice would be 
thinner covering at high dimensions (cf. Chapter 2]). 
But even at small dimensions, it may sometimes be use- 
ful. An example of this for m = 3 is shown in Figure. 2. 

5. COMPARISON WITH OTHER TECHNIQUES 

Given a probability distribution p G fi m , one popular in 
practice way of compressing it is to design a prefix code 
(for example, a Huffman code) for this distribution p first, 
and then encode the binary tree of such a code. Below 
we summarize some known results about performance of 
such schemes. 

5.1. Performance of tree-based quantizers 

By denoting by l\ , . . . , l m lengths of prefix codes, recall- 
ing that they satisfy Kraft inequality [6], and noting that 
2~ li can be used to map lengths back to probabilities, we 
arrive at the following set: 



Qtrcc — {[qi , 



\q, 



= 2 



-t-i 



1} 



m=5 



m=10 




Figure 3. Maximal L x distances vs rate characteristics d^[H](_R), dj[GM](_R), dJ[Q n ](i?) achievable by Huffman-, 
Gilbert-Moore-, and type-based quantization schemes. 



There are several specific algorithms that one can em- 
ploy for construction of codes, producing different sub- 
sets of Qticc- Below we only consider the use of classic 
Huffman and Gilbert-Moore iTOl codes. Some additional 
tree-based quantization schemes can be found in ifTTl . 

Proposition 3. There exists a set Qgm C Qtroo, such that 

cZkl[Qgm](#gm) «S 2, (26) 
WgmK^gm) < 2x^2, (27) 
OQgmK^gm) «S 1, (28) 

where 

R GM = log 2 |Qgm| = log 2 C m _i (29) 
= 2m- |log 2 m + 0(l), 

where C„ = —tt ( 2n ) is the Catalan number. 

Proof. We use Gilbert-Moore code 1T31 . Upper bound for 
KL-distance is well known lfl3l . L\ bound follows by 
Pinsker's inequality d24l i. L M bound is obvious: pi,qi G 
(0, 1). Gilbert-Moore code uses fixed assignment (e.g. 
from left to right) of letters to the codewords. Any binary 
rooted tree with m leaves can serve as a code. The number 
of such trees is given by the Catalan number C m _ j . □ 

Proposition 4. There exists a set Qh C Qh, such that 

^kJOhK^h) < 1, (30) 

dt[Qn)](Rn) < y/2hrt, (31) 

dUQnKRn) < §> (32) 

where 

i? H = log 2 |Qh| = mlog 2 to + O (m) . (33) 



Proof. We use Huffman code. Its KL-distance bound is 
well known [6 j. L\ bound follows by Pinsker's inequal- 
ity. Loo bound follows from sibling property of Huffman 
trees iTPH . It remains to estimate the number of Huff- 
man trees T m with m leaves. Consider a skewed tree, 
with leaves at depths 1, 2, . . . , m — 1, m — 1. The last 
two leaves can be labeled by (™) combinations of letters, 
whereas the other leaves - by (m — 2)! possible combi- 
nations. Hence T m > _ 2 )! = ±m!. Upper 
bound is obtained by arbitrary labeling all binary trees 
with m leaves: T m < m!C m _i, where C m _i is the 
Catalan number. Combining both we obtain: — j^m < 
log 2 T m - m log 2 m < (2 - m. □ 

5.2. Comparison 

We present comparison of maximal L\ distances achiev- 
able by tree-based and type-based quantization schemes 
in Figure 3. We consider cases of to = 5 and to = 10 
dimensions. It can be observed that the proposed type- 
based scheme is more efficient and much more versatile, 
allowing a wide range of possible rate/distance tradeoffs. 

6. CONCLUSIONS 

The problem of quantization of discrete probability distri- 
butions is studied. It is shown, that in many cases, this 
problem can be reduced to the covering radius problem 
for the unit simplex. Precise characterization of this prob- 
lem in high-rate regime is reported. A simple algorithm 
for solving this problem is also presented, analyzed, and 
compared to other known solutions. 
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